We have a cohort of heart failure patients from CCU with vital features that have been selected from the MIMIC III dataset. Different methods are utilized to predict heart failure mortality in CCU, including logistic regression, SVM, decision tree, random forest and boosting models. One favorite model is chosen from those predictable models evaluated by confusion matrix, ROC curves and AUC. We also figure out which features could be used to predict mortality of heart failure patients from one inference model.
knitr::opts_chunk$set(warning=FALSE, message=FALSE, tidy = TRUE)
setwd <- ("/Users/jill/Documents/614 final assignment/")
library(ggplot2)
library(stats)
library (dplyr)
library(corrplot)
library (dplyr)
library(psych)
library(rgl)
library(pwr)
library(pscl)
library(ISLR)
library(dlookr)
library(Hmisc)
library(pastecs)
library(car)
library(lattice)
library(caret)
library(rpart.plot)
library(pROC)
library(ROCR)
library(randomForest)
library(kernlab)
library(e1071)
library(ROSE)
hf_df <- read.csv("final_dataset.csv", header = T, stringsAsFactors = F)
summary(hf_df)
## subject_id hadm_id last_careunit age
## Min. : 26 Min. :100199 Length:1335 Min. :41.00
## 1st Qu.:11581 1st Qu.:128783 Class :character 1st Qu.:63.00
## Median :24950 Median :156806 Mode :character Median :73.00
## Mean :34944 Mean :154118 Mean :71.01
## 3rd Qu.:58048 3rd Qu.:179238 3rd Qu.:81.00
## Max. :99982 Max. :199963 Max. :89.00
## status los glucose_num
## Length:1335 Min. : 0.0012 Length:1335
## Class :character 1st Qu.: 1.2787 Class :character
## Mode :character Median : 2.2583 Mode :character
## Mean : 3.3424
## 3rd Qu.: 3.9614
## Max. :52.8108
## sodium_num wbc_count_num calcium_num
## Length:1335 Length:1335 Length:1335
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## hemoglobin_num creatinine_num urea_nitrogen_num
## Length:1335 Length:1335 Length:1335
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## chloride_num pco2_num
## Length:1335 Length:1335
## Class :character Class :character
## Mode :character Mode :character
##
##
##
str(hf_df)
## 'data.frame': 1335 obs. of 15 variables:
## $ subject_id : int 26 42 55 77 163 176 199 209 214 228 ...
## $ hadm_id : int 197661 119203 190665 142768 138528 135828 185360 190711 197273 167764 ...
## $ last_careunit : chr "CCU" "CCU" "CCU" "CCU" ...
## $ age : int 72 61 64 45 80 78 41 73 63 79 ...
## $ status : chr "dead" "dead" "Alive" "Alive" ...
## $ los : num 2.14 1.9 1.91 1.16 1.53 ...
## $ glucose_num : chr "129" "157" "106" "124" ...
## $ sodium_num : chr "143" "139" "144" "139" ...
## $ wbc_count_num : chr "8.2" "9.3" "11.6" "15.7" ...
## $ calcium_num : chr "8.8" "9.2" "9.4" "8.9" ...
## $ hemoglobin_num : chr "12.3" "12.5" "11.3" "13.3" ...
## $ creatinine_num : chr "1.4" "0.7" "0.8" "1.1" ...
## $ urea_nitrogen_num: chr "36" "18" "23" "20" ...
## $ chloride_num : chr "108" "105" "109" "102" ...
## $ pco2_num : chr "NULL" "NULL" "NULL" "52" ...
hf <- hf_df
summary(hf)
## subject_id hadm_id last_careunit age
## Min. : 26 Min. :100199 Length:1335 Min. :41.00
## 1st Qu.:11581 1st Qu.:128783 Class :character 1st Qu.:63.00
## Median :24950 Median :156806 Mode :character Median :73.00
## Mean :34944 Mean :154118 Mean :71.01
## 3rd Qu.:58048 3rd Qu.:179238 3rd Qu.:81.00
## Max. :99982 Max. :199963 Max. :89.00
## status los glucose_num
## Length:1335 Min. : 0.0012 Length:1335
## Class :character 1st Qu.: 1.2787 Class :character
## Mode :character Median : 2.2583 Mode :character
## Mean : 3.3424
## 3rd Qu.: 3.9614
## Max. :52.8108
## sodium_num wbc_count_num calcium_num
## Length:1335 Length:1335 Length:1335
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## hemoglobin_num creatinine_num urea_nitrogen_num
## Length:1335 Length:1335 Length:1335
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## chloride_num pco2_num
## Length:1335 Length:1335
## Class :character Class :character
## Mode :character Mode :character
##
##
##
# Remove meaningless variables for our model: subject_id, hadm_id, last_careunit.
hf <- hf[,-c(1,2,3)]
## Change variables to appropriate types.
# status need to be changed into factor type
hf$status <- factor(hf$status, levels = c("Alive","dead"))
summary(hf$status)
## Alive dead
## 663 672
# glucose_num, sodium_num, wbc_count_num, calcium_num, hemoglobin_num, creatinine_num, urea_nitrogen_num, chloride_num, pco2_num, all of these variables should be changed into numeric variables.
hf[,-c(1:3)] <- data.frame(lapply(hf[,-c(1:3)],as.numeric))
summary(hf)
## age status los glucose_num
## Min. :41.00 Alive:663 Min. : 0.0012 Min. : 78.0
## 1st Qu.:63.00 dead :672 1st Qu.: 1.2787 1st Qu.: 142.0
## Median :73.00 Median : 2.2583 Median : 181.0
## Mean :71.01 Mean : 3.3424 Mean : 210.9
## 3rd Qu.:81.00 3rd Qu.: 3.9614 3rd Qu.: 245.0
## Max. :89.00 Max. :52.8108 Max. :1601.0
## NA's :11
## sodium_num wbc_count_num calcium_num hemoglobin_num
## Min. :123.0 Min. : 2.500 Min. : 5.400 Min. : 7.90
## 1st Qu.:140.0 1st Qu.: 9.475 1st Qu.: 8.800 1st Qu.:11.30
## Median :142.0 Median : 12.600 Median : 9.200 Median :12.50
## Mean :141.8 Mean : 14.152 Mean : 9.192 Mean :12.68
## 3rd Qu.:144.0 3rd Qu.: 17.000 3rd Qu.: 9.500 3rd Qu.:14.00
## Max. :167.0 Max. :170.300 Max. :23.000 Max. :19.60
## NA's :11 NA's :11 NA's :30 NA's :11
## creatinine_num urea_nitrogen_num chloride_num pco2_num
## Min. : 0.400 Min. : 4.00 Min. : 87.0 Min. : 22.00
## 1st Qu.: 1.100 1st Qu.: 23.00 1st Qu.:103.0 1st Qu.: 40.00
## Median : 1.500 Median : 35.00 Median :106.0 Median : 46.50
## Mean : 2.186 Mean : 44.81 Mean :106.3 Mean : 50.22
## 3rd Qu.: 2.500 3rd Qu.: 58.00 3rd Qu.:109.0 3rd Qu.: 55.00
## Max. :17.800 Max. :272.00 Max. :134.0 Max. :175.00
## NA's :10 NA's :10 NA's :11 NA's :461
# As pco2_num has a lot of NAs, drop this variable.
hf <- subset(hf,select = - pco2_num)
# There are some variables still have a few NAs, will be imputed after data splitting.
summary(hf)
## age status los glucose_num
## Min. :41.00 Alive:663 Min. : 0.0012 Min. : 78.0
## 1st Qu.:63.00 dead :672 1st Qu.: 1.2787 1st Qu.: 142.0
## Median :73.00 Median : 2.2583 Median : 181.0
## Mean :71.01 Mean : 3.3424 Mean : 210.9
## 3rd Qu.:81.00 3rd Qu.: 3.9614 3rd Qu.: 245.0
## Max. :89.00 Max. :52.8108 Max. :1601.0
## NA's :11
## sodium_num wbc_count_num calcium_num hemoglobin_num
## Min. :123.0 Min. : 2.500 Min. : 5.400 Min. : 7.90
## 1st Qu.:140.0 1st Qu.: 9.475 1st Qu.: 8.800 1st Qu.:11.30
## Median :142.0 Median : 12.600 Median : 9.200 Median :12.50
## Mean :141.8 Mean : 14.152 Mean : 9.192 Mean :12.68
## 3rd Qu.:144.0 3rd Qu.: 17.000 3rd Qu.: 9.500 3rd Qu.:14.00
## Max. :167.0 Max. :170.300 Max. :23.000 Max. :19.60
## NA's :11 NA's :11 NA's :30 NA's :11
## creatinine_num urea_nitrogen_num chloride_num
## Min. : 0.400 Min. : 4.00 Min. : 87.0
## 1st Qu.: 1.100 1st Qu.: 23.00 1st Qu.:103.0
## Median : 1.500 Median : 35.00 Median :106.0
## Mean : 2.186 Mean : 44.81 Mean :106.3
## 3rd Qu.: 2.500 3rd Qu.: 58.00 3rd Qu.:109.0
## Max. :17.800 Max. :272.00 Max. :134.0
## NA's :10 NA's :10 NA's :11
lapply(hf[,-2], sd, na.rm = T)
## $age
## [1] 11.25485
##
## $los
## [1] 3.577665
##
## $glucose_num
## [1] 107.9483
##
## $sodium_num
## [1] 3.890178
##
## $wbc_count_num
## [1] 8.61156
##
## $calcium_num
## [1] 0.8023754
##
## $hemoglobin_num
## [1] 1.879497
##
## $creatinine_num
## [1] 1.904505
##
## $urea_nitrogen_num
## [1] 29.19859
##
## $chloride_num
## [1] 5.234303
This dataset has 11 variables, 10 are numerical and 1 is categorical variable. We’ll choose status as response variable and other 10 numerical variables as predictoes.
Univariate
q <- ggplot(hf)
# Age distribution colored by mortality
q+geom_density(aes(x=age, fill=status), alpha = 0.8) +
ggtitle("Destribution of Age (density)") + xlab("Age (year)") + ylab("Density")
q+ geom_histogram(aes(x=age, fill=status),bins = 20, position = "dodge") +
ggtitle("Destribution of Age (histogram)") + xlab("Age (year)") + ylab("Frequency")
# Los distribution colored by mortality
q+geom_density(aes(x=los, fill=status), alpha = 0.8) +
ggtitle("Destribution of Los (density)") + xlab("Los (day)") + ylab("Density")
q+ geom_histogram(aes(x=los, fill=status),bins = 10, position = "dodge") +
ggtitle("Destribution of Los (histogram)") + xlab("Los (day)") + ylab("Frequency")
# Glucose_num distribution colored by mortality
q+geom_density(aes(x=glucose_num, fill=status), alpha = 0.8) +
ggtitle("Destribution of glucose_num (density)") + xlab("Glucose (mg/dL)") + ylab("Density")
q+ geom_histogram(aes(x=glucose_num, fill=status),bins = 8, position = "dodge") +
ggtitle("Destribution of glucose_num (histogram)") + xlab("Glucose (mg/dL)") + ylab("Frequency")
# Sodium_num distribution colored by mortality
q+geom_density(aes(x=sodium_num, fill=status), alpha = 0.8) +
ggtitle("Destribution of Sodium_num (density)") +
xlab("Sodium_num (mmol/L)") + ylab("Density")
q+ geom_histogram(aes(x=sodium_num, fill=status),bins = 15, position = "dodge") +
ggtitle("Destribution of Sodium_num (histogram)") +
xlab("Sodium_num (mmol/L)") + ylab("Frequency")
# Wbc_count_num distribution colored by mortality
q+geom_density(aes(x=wbc_count_num, fill=status), alpha = 0.8) +
ggtitle("Destribution of Wbc_count_num (density)") +
xlab("Wbc_count_num (x 10-3/mL)") + ylab("Density")
q+ geom_histogram(aes(x=wbc_count_num, fill=status),bins = 10, position = "dodge") +
ggtitle("Destribution of Wbc_count_num (histogram)") +
xlab("Wbc_count_num (x 10-3/mL)") + ylab("Frequency")
# Calcium_num distribution colored by mortality
q+geom_density(aes(x=calcium_num, fill=status), alpha = 0.8) +
ggtitle("Destribution of Calcium_num (density)") +
xlab("Calcium_num (mg/dL)") + ylab("Density")
q+ geom_histogram(aes(x=calcium_num, fill=status),bins = 10, position = "dodge") +
ggtitle("Destribution of Calcium_num (histogram)") +
xlab("Calcium_num (mg/dL)") + ylab("Frequency")
# Hemoglobin_num distribution colored by mortality
q+geom_density(aes(x=hemoglobin_num, fill=status), alpha = 0.8) +
ggtitle("Destribution of Hemoglobin_num (density)") +
xlab("Hemoglobin_num (mg/dL)") + ylab("Density")
q+ geom_histogram(aes(x=hemoglobin_num, fill=status),bins = 10, position = "dodge") +
ggtitle("Destribution of Hemoglobin_num (histogram)") +
xlab("Hemoglobin_num (mg/dL)") + ylab("Frequency")
# Creatinine_num distribution colored by mortality
q+geom_density(aes(x=creatinine_num, fill=status), alpha = 0.8) +
ggtitle("Destribution of Creatinine_num (density)") +
xlab("Creatinine_num (mg/dL)") + ylab("Density")
q+ geom_histogram(aes(x=creatinine_num, fill=status),bins = 10, position = "dodge") +
ggtitle("Destribution of Creatinine_num (histogram)") +
xlab("Creatinine_num (mg/dL)") + ylab("Frequency")
# Urea_nitrogen_num distribution colored by mortality
q+geom_density(aes(x=urea_nitrogen_num, fill=status), alpha = 0.8) +
ggtitle("Destribution of Urea_nitrogen_num (density)") +
xlab("Urea_nitrogen_num (mg/dL)") + ylab("Density")
q+ geom_histogram(aes(x=urea_nitrogen_num, fill=status),bins = 10, position = "dodge") +
ggtitle("Destribution of Urea_nitrogen_num (histogram)") +
xlab("Urea_nitrogen_num (mg/dL)") + ylab("Frequency")
# Chloride_num distribution colored by mortality
q+geom_density(aes(x=chloride_num, fill=status), alpha = 0.8) +
ggtitle("Destribution of Chloride_num (density)") +
xlab("Chloride_num (mmol/L)") + ylab("Density")
q+ geom_histogram(aes(x=chloride_num, fill=status),bins = 10, position = "dodge") +
ggtitle("Destribution of Chloride_num (histogram)") +
xlab("Chloride_num (mmol/L)") + ylab("Frequency")
The average age of patients who were dead in CCU is older than that of patients who were alive when discharging. Los of patients are centered on 0-10 days, but the frenquency of alive patients are higher. For patients who were dead, average of hemoglobin’s level is lower than that of patients were alive. For patients who were dead, average of glucose’s level, creatinine’s level, urea_nitrogen level are higher than that of patients were alive. The distributions of sodium_num, wbc_count_num, calcium_num and chloride_num of patients are similar.
Correlation
# Correlation matrix
hf_num <- hf
hf_num$status <- as.numeric(hf_num$status)
c <- cor(hf_num, use = "pairwise.complete.obs", method = "spearman" )
corrplot(c)
# visualizations of relationships between variables:"sodium_num & chloride_num", "creatinine_num & urea_nitrogen_num" colored by status.
q + geom_point(aes(x=sodium_num,y=chloride_num,col=status)) +
ggtitle("Relationship between Sodium_num and Chloride_num") +
xlab("Sodium_num (mmol/L)") + ylab("Chloride_num (mmol/L)")
q + geom_point(aes(x=creatinine_num,y=urea_nitrogen_num,col=status)) +
ggtitle("Relationship between Creatinine_num and Urea_nitrogen_num") +
xlab("Creatinine_num (mg/dL)") + ylab("Urea_nitrogen_num (mg/dL)")
On the basis of correlation matrix, it shows that sodium_num is high related with chloride_num, and creatinine_num is high related with urea_nitrogen_num. The scatter plots confirmed that those pairwise variables have positive relationships between them. If logistic model is not good fitted with data, we might check variables and remove one of collinearity variables.
# Split data
set.seed(3033)
intrain <- createDataPartition(y = hf$status, p= 0.7, list = FALSE)
training <- hf[intrain,]
testing <- hf[-intrain,]
dim(intrain); dim(training); dim(testing)
## [1] 936 1
## [1] 936 11
## [1] 399 11
# Impute NAs - as there are a few NAs in all numeric variables, impute them with mean.
# Training set
im_train <- training
summary(im_train)
## age status los glucose_num
## Min. :41.00 Alive:465 Min. : 0.0012 Min. : 78.0
## 1st Qu.:63.00 dead :471 1st Qu.: 1.2677 1st Qu.: 143.0
## Median :73.00 Median : 2.2338 Median : 183.0
## Mean :71.01 Mean : 3.3799 Mean : 212.7
## 3rd Qu.:81.00 3rd Qu.: 3.9679 3rd Qu.: 247.0
## Max. :88.00 Max. :52.8108 Max. :1601.0
## NA's :7
## sodium_num wbc_count_num calcium_num hemoglobin_num
## Min. :123.0 Min. : 2.50 Min. : 5.400 Min. : 7.90
## 1st Qu.:140.0 1st Qu.: 9.40 1st Qu.: 8.800 1st Qu.:11.20
## Median :142.0 Median : 12.60 Median : 9.200 Median :12.50
## Mean :141.8 Mean : 14.35 Mean : 9.195 Mean :12.65
## 3rd Qu.:144.0 3rd Qu.: 17.20 3rd Qu.: 9.500 3rd Qu.:14.00
## Max. :167.0 Max. :170.30 Max. :23.000 Max. :18.70
## NA's :7 NA's :7 NA's :19 NA's :7
## creatinine_num urea_nitrogen_num chloride_num
## Min. : 0.400 Min. : 4.00 Min. : 87.0
## 1st Qu.: 1.100 1st Qu.: 23.00 1st Qu.:103.0
## Median : 1.500 Median : 35.00 Median :106.0
## Mean : 2.208 Mean : 44.75 Mean :106.3
## 3rd Qu.: 2.400 3rd Qu.: 59.00 3rd Qu.:110.0
## Max. :17.800 Max. :272.00 Max. :134.0
## NA's :6 NA's :6 NA's :7
im_train <- data.frame(lapply(im_train, function(x) {
if (is.numeric(x)){
x[is.na(x)]<- mean(x,na.rm =T)
}
x
}))
summary(im_train)
## age status los glucose_num
## Min. :41.00 Alive:465 Min. : 0.0012 Min. : 78.0
## 1st Qu.:63.00 dead :471 1st Qu.: 1.2677 1st Qu.: 143.8
## Median :73.00 Median : 2.2338 Median : 183.0
## Mean :71.01 Mean : 3.3799 Mean : 212.7
## 3rd Qu.:81.00 3rd Qu.: 3.9679 3rd Qu.: 246.2
## Max. :88.00 Max. :52.8108 Max. :1601.0
## sodium_num wbc_count_num calcium_num hemoglobin_num
## Min. :123.0 Min. : 2.500 Min. : 5.400 Min. : 7.90
## 1st Qu.:140.0 1st Qu.: 9.475 1st Qu.: 8.800 1st Qu.:11.20
## Median :142.0 Median : 12.600 Median : 9.195 Median :12.50
## Mean :141.8 Mean : 14.350 Mean : 9.195 Mean :12.65
## 3rd Qu.:144.0 3rd Qu.: 17.200 3rd Qu.: 9.500 3rd Qu.:14.00
## Max. :167.0 Max. :170.300 Max. :23.000 Max. :18.70
## creatinine_num urea_nitrogen_num chloride_num
## Min. : 0.400 Min. : 4.00 Min. : 87.0
## 1st Qu.: 1.100 1st Qu.: 23.00 1st Qu.:103.0
## Median : 1.500 Median : 36.00 Median :106.0
## Mean : 2.208 Mean : 44.75 Mean :106.3
## 3rd Qu.: 2.400 3rd Qu.: 59.00 3rd Qu.:110.0
## Max. :17.800 Max. :272.00 Max. :134.0
# Testing set
im_test <- testing
summary(im_test)
## age status los glucose_num
## Min. :43 Alive:198 Min. : 0.2203 Min. : 94.0
## 1st Qu.:63 dead :201 1st Qu.: 1.3042 1st Qu.:141.0
## Median :73 Median : 2.2819 Median :176.0
## Mean :71 Mean : 3.2546 Mean :206.7
## 3rd Qu.:81 3rd Qu.: 3.9510 3rd Qu.:239.5
## Max. :89 Max. :30.8874 Max. :798.0
## NA's :4
## sodium_num wbc_count_num calcium_num hemoglobin_num
## Min. :128.0 Min. : 4.30 Min. : 6.900 Min. : 8.40
## 1st Qu.:140.0 1st Qu.: 9.55 1st Qu.: 8.800 1st Qu.:11.30
## Median :142.0 Median :12.70 Median : 9.100 Median :12.50
## Mean :141.8 Mean :13.69 Mean : 9.185 Mean :12.75
## 3rd Qu.:144.0 3rd Qu.:16.35 3rd Qu.: 9.500 3rd Qu.:14.10
## Max. :155.0 Max. :36.50 Max. :14.100 Max. :19.60
## NA's :4 NA's :4 NA's :11 NA's :4
## creatinine_num urea_nitrogen_num chloride_num
## Min. : 0.400 Min. : 8.00 Min. : 91.0
## 1st Qu.: 1.100 1st Qu.: 24.00 1st Qu.:103.0
## Median : 1.500 Median : 35.00 Median :106.0
## Mean : 2.135 Mean : 44.94 Mean :106.2
## 3rd Qu.: 2.500 3rd Qu.: 57.00 3rd Qu.:109.0
## Max. :11.800 Max. :156.00 Max. :125.0
## NA's :4 NA's :4 NA's :4
im_test <- data.frame(lapply(im_test, function(x) {
if (is.numeric(x)){
x[is.na(x)]<- mean(x,na.rm =T)
}
x
}))
summary(im_test)
## age status los glucose_num
## Min. :43 Alive:198 Min. : 0.2203 Min. : 94.0
## 1st Qu.:63 dead :201 1st Qu.: 1.3042 1st Qu.:141.0
## Median :73 Median : 2.2819 Median :178.0
## Mean :71 Mean : 3.2546 Mean :206.7
## 3rd Qu.:81 3rd Qu.: 3.9510 3rd Qu.:238.5
## Max. :89 Max. :30.8874 Max. :798.0
## sodium_num wbc_count_num calcium_num hemoglobin_num
## Min. :128.0 Min. : 4.30 Min. : 6.900 Min. : 8.40
## 1st Qu.:140.0 1st Qu.: 9.60 1st Qu.: 8.800 1st Qu.:11.30
## Median :142.0 Median :12.80 Median : 9.185 Median :12.50
## Mean :141.8 Mean :13.69 Mean : 9.185 Mean :12.75
## 3rd Qu.:144.0 3rd Qu.:16.25 3rd Qu.: 9.500 3rd Qu.:14.10
## Max. :155.0 Max. :36.50 Max. :14.100 Max. :19.60
## creatinine_num urea_nitrogen_num chloride_num
## Min. : 0.400 Min. : 8.00 Min. : 91.0
## 1st Qu.: 1.100 1st Qu.: 24.00 1st Qu.:103.0
## Median : 1.500 Median : 36.00 Median :106.0
## Mean : 2.135 Mean : 44.94 Mean :106.2
## 3rd Qu.: 2.500 3rd Qu.: 57.00 3rd Qu.:109.0
## Max. :11.800 Max. :156.00 Max. :125.0
fit_main <- glm(status ~ ., data = im_train, family=binomial)
fit_null <- glm(status ~ 1, data = im_train, family=binomial)
fit_final <-step(fit_null, scope=list(lower=fit_null, upper=fit_main),direction="both")
## Start: AIC=1299.53
## status ~ 1
##
## Df Deviance AIC
## + urea_nitrogen_num 1 1168.6 1172.6
## + age 1 1198.8 1202.8
## + creatinine_num 1 1235.3 1239.3
## + hemoglobin_num 1 1257.8 1261.8
## + wbc_count_num 1 1271.3 1275.3
## + glucose_num 1 1285.6 1289.6
## + los 1 1287.8 1291.8
## <none> 1297.5 1299.5
## + sodium_num 1 1295.6 1299.6
## + calcium_num 1 1296.6 1300.6
## + chloride_num 1 1297.0 1301.0
##
## Step: AIC=1172.63
## status ~ urea_nitrogen_num
##
## Df Deviance AIC
## + age 1 1090.3 1096.3
## + wbc_count_num 1 1151.7 1157.7
## + hemoglobin_num 1 1152.8 1158.8
## <none> 1168.6 1172.6
## + calcium_num 1 1167.1 1173.1
## + glucose_num 1 1167.5 1173.5
## + los 1 1167.8 1173.8
## + sodium_num 1 1168.4 1174.4
## + creatinine_num 1 1168.5 1174.5
## + chloride_num 1 1168.6 1174.6
## - urea_nitrogen_num 1 1297.5 1299.5
##
## Step: AIC=1096.31
## status ~ urea_nitrogen_num + age
##
## Df Deviance AIC
## + wbc_count_num 1 1069.5 1077.5
## + hemoglobin_num 1 1085.6 1093.6
## + los 1 1087.2 1095.2
## + glucose_num 1 1087.2 1095.2
## + creatinine_num 1 1087.3 1095.3
## <none> 1090.3 1096.3
## + chloride_num 1 1089.9 1097.9
## + calcium_num 1 1090.2 1098.2
## + sodium_num 1 1090.3 1098.3
## - age 1 1168.6 1172.6
## - urea_nitrogen_num 1 1198.8 1202.8
##
## Step: AIC=1077.47
## status ~ urea_nitrogen_num + age + wbc_count_num
##
## Df Deviance AIC
## + hemoglobin_num 1 1061.6 1071.6
## + creatinine_num 1 1066.8 1076.8
## + chloride_num 1 1067.2 1077.2
## <none> 1069.5 1077.5
## + sodium_num 1 1068.9 1078.9
## + glucose_num 1 1068.9 1078.9
## + los 1 1069.0 1079.0
## + calcium_num 1 1069.3 1079.3
## - wbc_count_num 1 1090.3 1096.3
## - age 1 1151.7 1157.7
## - urea_nitrogen_num 1 1167.1 1173.1
##
## Step: AIC=1071.64
## status ~ urea_nitrogen_num + age + wbc_count_num + hemoglobin_num
##
## Df Deviance AIC
## + chloride_num 1 1059.2 1071.2
## <none> 1061.6 1071.6
## + creatinine_num 1 1060.0 1072.0
## + los 1 1060.9 1072.9
## + glucose_num 1 1061.1 1073.1
## + sodium_num 1 1061.3 1073.3
## + calcium_num 1 1061.6 1073.6
## - hemoglobin_num 1 1069.5 1077.5
## - wbc_count_num 1 1085.6 1093.6
## - age 1 1131.6 1139.6
## - urea_nitrogen_num 1 1143.1 1151.1
##
## Step: AIC=1071.21
## status ~ urea_nitrogen_num + age + wbc_count_num + hemoglobin_num +
## chloride_num
##
## Df Deviance AIC
## <none> 1059.2 1071.2
## - chloride_num 1 1061.6 1071.6
## + los 1 1057.8 1071.8
## + creatinine_num 1 1058.0 1072.0
## + glucose_num 1 1058.7 1072.7
## + sodium_num 1 1058.8 1072.8
## + calcium_num 1 1059.2 1073.2
## - hemoglobin_num 1 1067.2 1077.2
## - wbc_count_num 1 1085.2 1095.2
## - age 1 1130.9 1140.9
## - urea_nitrogen_num 1 1136.7 1146.7
summary(fit_final) # chloride_num's p-value > 0.05, is non-statistic significant, remove this variable.
##
## Call:
## glm(formula = status ~ urea_nitrogen_num + age + wbc_count_num +
## hemoglobin_num + chloride_num, family = binomial, data = im_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.83477 -0.95015 0.08639 0.95255 2.15081
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.196087 1.703489 -1.289 0.19734
## urea_nitrogen_num 0.025854 0.003225 8.017 1.08e-15 ***
## age 0.058539 0.007268 8.054 8.03e-16 ***
## wbc_count_num 0.061793 0.013705 4.509 6.52e-06 ***
## hemoglobin_num -0.122085 0.043450 -2.810 0.00496 **
## chloride_num -0.022570 0.014499 -1.557 0.11956
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1297.5 on 935 degrees of freedom
## Residual deviance: 1059.2 on 930 degrees of freedom
## AIC: 1071.2
##
## Number of Fisher Scoring iterations: 5
fit_final2 <- glm(status ~ urea_nitrogen_num + age + wbc_count_num +
hemoglobin_num, data = im_train, family=binomial)
summary(fit_final2)
##
## Call:
## glm(formula = status ~ urea_nitrogen_num + age + wbc_count_num +
## hemoglobin_num, family = binomial, data = im_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.78168 -0.96077 0.08109 0.96166 2.11101
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.508316 0.847339 -5.321 1.03e-07 ***
## urea_nitrogen_num 0.026390 0.003216 8.205 2.31e-16 ***
## age 0.057464 0.007210 7.970 1.58e-15 ***
## wbc_count_num 0.057886 0.013362 4.332 1.48e-05 ***
## hemoglobin_num -0.120442 0.043326 -2.780 0.00544 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1297.5 on 935 degrees of freedom
## Residual deviance: 1061.6 on 931 degrees of freedom
## AIC: 1071.6
##
## Number of Fisher Scoring iterations: 5
pR2(fit_final2)
## llh llhNull G2 McFadden r2ML
## -530.8223568 -648.7665301 235.8883465 0.1817976 0.2227688
## r2CU
## 0.2970292
pred_glm <-predict(fit_final2, im_test)
df_compare_glm <- data.frame(pred_glm,im_test$status)
# May view dataframe df_compare
plot(df_compare_glm)
test_pred <- ifelse(pred_glm >=0.5, "dead","Alive")
test_pred <- factor(test_pred, levels = c("Alive","dead"))
confusionMatrix(test_pred, im_test$status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive dead
## Alive 167 101
## dead 31 100
##
## Accuracy : 0.6692
## 95% CI : (0.6206, 0.7152)
## No Information Rate : 0.5038
## P-Value [Acc > NIR] : 1.749e-11
##
## Kappa : 0.34
##
## Mcnemar's Test P-Value : 1.905e-09
##
## Sensitivity : 0.8434
## Specificity : 0.4975
## Pos Pred Value : 0.6231
## Neg Pred Value : 0.7634
## Prevalence : 0.4962
## Detection Rate : 0.4185
## Detection Prevalence : 0.6717
## Balanced Accuracy : 0.6705
##
## 'Positive' Class : Alive
##
# ROC curve
p_log <- predict(fit_final2, newdata=im_test, type="response")
pr_log <- prediction(pred_glm, im_test$status)
prf_log <- performance(pr_log, measure = "tpr", x.measure = "fpr")
par(pty = "s")
plot(prf_log, colorize = T,
main="ROC curve of logistic regression")
abline(a=0, b=1)
auc_log <- performance(pr_log, measure = "auc")
auc_log <- round(auc_log@y.values[[1]],5)
legend(.6,.2,auc_log,title="AUC",cex = .8)
Final model includes 4 predictors which each p-value is far less than 0.05: urea_nitrogen_num, age, wbc_count_num, hemoglobin_num. The equation is: logit(P) = -4.51 + 0.026urea_nitrogen_num + 0.057age + 0.058wbc_count_num - 0.12hemoglobin_num
For 1 mg/dL increases in urea_nitrogen_num, the odds of mortality of heart failure patients is multiplied by 1.026 [exp(0.026)] on average, assuming other variables are held constant.
For 1 year increases in age, the odds of mortality of heart failure patients is multiplied by 1.059 [exp(0.057)] on average, assuming other variables are held constant.
For 1 x 10-3/mL increases in wbc_count_num, the odds of mortality of heart failure patients is multiplied by 1.06 [exp(0.058)] on average, assuming other variables are held constant.
For 1 mg/dL increases in hemoglobin_num, the odds of mortality of heart failure patients is multiplied by 0.88 [exp(-0.12)] on average, assuming other variables are held constant.
In pR2 results, McFadden is 0.182. It indicates that this model does not fit the data well, it can only explain 18% values of response variable in the dataset.The reason is our observations is not big enough, and the predictors might not include all of the vital predictors for mortality of heart failure patients. Plot of df_compare_glm confirms that this model does not predict well. Points are destributed widely comparing with true value.
Confusion Matrix indicates that this model could predict 100 patients mortality out of 201 patients who were dead in reality, and 167 alive patients out of 198 patients. Although this model predicts patients alive well, but predicting patients dead is more important and meaningful for clinicians and healthcare.
# Standardize data first
trctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3,classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(3233)
svm_linear <- train(status ~., data =im_train, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10,
metric = "ROC")
svm_linear
## Support Vector Machines with Linear Kernel
##
## 936 samples
## 10 predictor
## 2 classes: 'Alive', 'dead'
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ...
## Resampling results:
##
## ROC Sens Spec
## 0.7681942 0.69319 0.6801344
##
## Tuning parameter 'C' was held constant at a value of 1
svm_linear$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 1
##
## Linear (vanilla) kernel function.
##
## Number of Support Vectors : 637
##
## Objective Function Value : -630.6096
## Training error : 0.306624
## Probability model included.
test_pred_sl<- predict(svm_linear, newdata = im_test)
confusionMatrix(test_pred_sl, im_test$status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive dead
## Alive 136 58
## dead 62 143
##
## Accuracy : 0.6992
## 95% CI : (0.6516, 0.7439)
## No Information Rate : 0.5038
## P-Value [Acc > NIR] : 1.778e-15
##
## Kappa : 0.3984
##
## Mcnemar's Test P-Value : 0.7842
##
## Sensitivity : 0.6869
## Specificity : 0.7114
## Pos Pred Value : 0.7010
## Neg Pred Value : 0.6976
## Prevalence : 0.4962
## Detection Rate : 0.3409
## Detection Prevalence : 0.4862
## Balanced Accuracy : 0.6992
##
## 'Positive' Class : Alive
##
par(pty = "s")
test_pred_sl2 <- predict(svm_linear, newdata = im_test,type="prob")
roc(im_test$status,test_pred_sl2[,2],plot=T,legacy.axes = T, col="purple",
main="ROC Curve of SVMLinear Model", xlab="FPR", ylab = "TPR",
print.auc = T, print.auc.x=0.4,print.auc.y=0.3)
##
## Call:
## roc.default(response = im_test$status, predictor = test_pred_sl2[, 2], plot = T, legacy.axes = T, col = "purple", main = "ROC Curve of SVMLinear Model", xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4, print.auc.y = 0.3)
##
## Data: test_pred_sl2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7477
#SVM Polynomial Model
set.seed(3233)
trctrl_sp <- trainControl(method = "repeatedcv", number = 3, repeats = 3,classProbs = TRUE,
summaryFunction = twoClassSummary)
svm_poly <- train(status ~., data =im_train, method = "svmPoly",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 3,
metric = "ROC")
svm_poly
## Support Vector Machines with Polynomial Kernel
##
## 936 samples
## 10 predictor
## 2 classes: 'Alive', 'dead'
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ...
## Resampling results across tuning parameters:
##
## degree scale C ROC Sens Spec
## 1 0.001 0.25 0.7658440 0.9362007 0.3029041
## 1 0.001 0.50 0.7658516 0.9347670 0.2994102
## 1 0.001 1.00 0.7667078 0.7125448 0.6531616
## 1 0.010 0.25 0.7661719 0.6551971 0.7260918
## 1 0.010 0.50 0.7681771 0.6716846 0.7069802
## 1 0.010 1.00 0.7678975 0.6845878 0.6921463
## 1 0.100 0.25 0.7682457 0.6817204 0.6907428
## 1 0.100 0.50 0.7691454 0.6860215 0.6864875
## 1 0.100 1.00 0.7679275 0.6896057 0.6843897
## 2 0.001 0.25 0.7659888 0.9354839 0.3000971
## 2 0.001 0.50 0.7668291 0.7168459 0.6453453
## 2 0.001 1.00 0.7662854 0.6494624 0.7367301
## 2 0.010 0.25 0.7707176 0.6724014 0.7112430
## 2 0.010 0.50 0.7706477 0.6910394 0.6815230
## 2 0.010 1.00 0.7707136 0.6888889 0.6871892
## 2 0.100 0.25 0.7658659 0.6967742 0.6751474
## 2 0.100 0.50 0.7603310 0.6967742 0.6779993
## 2 0.100 1.00 0.7566650 0.6989247 0.6730422
## 3 0.001 0.25 0.7664133 0.9146953 0.3466891
## 3 0.001 0.50 0.7663531 0.6666667 0.7077044
## 3 0.001 1.00 0.7673892 0.6645161 0.7126316
## 3 0.010 0.25 0.7709464 0.6896057 0.6815155
## 3 0.010 0.50 0.7709002 0.6967742 0.6822396
## 3 0.010 1.00 0.7706426 0.6874552 0.6857783
## 3 0.100 0.25 0.7465041 0.6724014 0.6751922
## 3 0.100 0.50 0.7382985 0.6810036 0.6575215
## 3 0.100 1.00 0.7268847 0.6767025 0.6526017
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.01 and C
## = 0.25.
svm_poly$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 0.25
##
## Polynomial kernel function.
## Hyperparameters : degree = 3 scale = 0.01 offset = 1
##
## Number of Support Vectors : 715
##
## Objective Function Value : -167.6995
## Training error : 0.300214
## Probability model included.
test_pred_sp <- predict(svm_poly, newdata = im_test)
confusionMatrix(test_pred_sp, im_test$status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive dead
## Alive 140 65
## dead 58 136
##
## Accuracy : 0.6917
## 95% CI : (0.6439, 0.7367)
## No Information Rate : 0.5038
## P-Value [Acc > NIR] : 2.067e-14
##
## Kappa : 0.3836
##
## Mcnemar's Test P-Value : 0.5885
##
## Sensitivity : 0.7071
## Specificity : 0.6766
## Pos Pred Value : 0.6829
## Neg Pred Value : 0.7010
## Prevalence : 0.4962
## Detection Rate : 0.3509
## Detection Prevalence : 0.5138
## Balanced Accuracy : 0.6918
##
## 'Positive' Class : Alive
##
par(pty = "s")
test_pred_sp2 <- predict(svm_poly, newdata = im_test,type="prob")
roc(im_test$status,test_pred_sp2[,2],plot=T,legacy.axes = T, col="blue",
main="ROC Curve of SVMPoly Model", xlab="FPR", ylab = "TPR",
print.auc = T, print.auc.x=0.4,print.auc.y=0.3)
##
## Call:
## roc.default(response = im_test$status, predictor = test_pred_sp2[, 2], plot = T, legacy.axes = T, col = "blue", main = "ROC Curve of SVMPoly Model", xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4, print.auc.y = 0.3)
##
## Data: test_pred_sp2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7418
# SVM RBF Model
set.seed(3233)
svm_RBF <- train(status ~., data =im_train, method = "svmRadial",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10,
metric = "ROC")
svm_RBF
## Support Vector Machines with Radial Basis Function Kernel
##
## 936 samples
## 10 predictor
## 2 classes: 'Alive', 'dead'
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ...
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.25 0.7628504 0.6551971 0.7162150
## 0.50 0.7595249 0.6609319 0.7190444
## 1.00 0.7529735 0.6566308 0.7063233
## 2.00 0.7445889 0.6681004 0.6907951
## 4.00 0.7316517 0.6559140 0.6872340
## 8.00 0.7125140 0.6415771 0.6702426
## 16.00 0.6942941 0.6408602 0.6524897
## 32.00 0.6773674 0.6344086 0.6341172
## 64.00 0.6662330 0.6229391 0.6221202
## 128.00 0.6598725 0.6121864 0.6192982
##
## Tuning parameter 'sigma' was held constant at a value of 0.1173631
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1173631 and C = 0.25.
svm_RBF$finalModel
## Support Vector Machine object of class "ksvm"
##
## SV type: C-svc (classification)
## parameter : cost C = 0.25
##
## Gaussian Radial Basis kernel function.
## Hyperparameter : sigma = 0.11736310351346
##
## Number of Support Vectors : 699
##
## Objective Function Value : -155.1019
## Training error : 0.287393
## Probability model included.
test_pred_sr <- predict(svm_RBF, newdata = im_test)
confusionMatrix(test_pred_sr, im_test$status)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive dead
## Alive 132 63
## dead 66 138
##
## Accuracy : 0.6767
## 95% CI : (0.6284, 0.7224)
## No Information Rate : 0.5038
## P-Value [Acc > NIR] : 2.047e-12
##
## Kappa : 0.3533
##
## Mcnemar's Test P-Value : 0.8602
##
## Sensitivity : 0.6667
## Specificity : 0.6866
## Pos Pred Value : 0.6769
## Neg Pred Value : 0.6765
## Prevalence : 0.4962
## Detection Rate : 0.3308
## Detection Prevalence : 0.4887
## Balanced Accuracy : 0.6766
##
## 'Positive' Class : Alive
##
par(pty = "s")
test_pred_sr2 <- predict(svm_RBF, newdata = im_test,type="prob")
roc(im_test$status,test_pred_sr2[,2],plot=T,legacy.axes = T, col="red",
main="ROC Curve of SVMRBF Model", xlab="FPR", ylab = "TPR",
print.auc = T, print.auc.x=0.4,print.auc.y=0.3)
##
## Call:
## roc.default(response = im_test$status, predictor = test_pred_sr2[, 2], plot = T, legacy.axes = T, col = "red", main = "ROC Curve of SVMRBF Model", xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4, print.auc.y = 0.3)
##
## Data: test_pred_sr2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.746
set.seed(3233)
dtree_fit <- train(status ~., data = im_train, method = "rpart",
parms = list(split = "information"),
trControl=trctrl,
tuneLength = 10,
metric = "ROC")
dtree_fit
## CART
##
## 936 samples
## 10 predictor
## 2 classes: 'Alive', 'dead'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.001075269 0.7046784 0.6537634 0.6235685
## 0.002150538 0.7042373 0.6602151 0.6313475
## 0.003584229 0.7038417 0.6401434 0.6730123
## 0.005017921 0.7046679 0.6501792 0.6935424
## 0.005376344 0.7120532 0.6523297 0.7020157
## 0.006451613 0.7131921 0.6422939 0.7176334
## 0.007526882 0.7155879 0.6322581 0.7289586
## 0.008602151 0.7209940 0.6243728 0.7445166
## 0.032258065 0.7047283 0.7132616 0.6580814
## 0.363440860 0.6134690 0.5118280 0.7151101
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.008602151.
test_pred_dt <- predict(dtree_fit, newdata = im_test)
confusionMatrix(test_pred_dt, im_test$status )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive dead
## Alive 109 39
## dead 89 162
##
## Accuracy : 0.6792
## 95% CI : (0.6309, 0.7248)
## No Information Rate : 0.5038
## P-Value [Acc > NIR] : 9.791e-13
##
## Kappa : 0.3571
##
## Mcnemar's Test P-Value : 1.484e-05
##
## Sensitivity : 0.5505
## Specificity : 0.8060
## Pos Pred Value : 0.7365
## Neg Pred Value : 0.6454
## Prevalence : 0.4962
## Detection Rate : 0.2732
## Detection Prevalence : 0.3709
## Balanced Accuracy : 0.6782
##
## 'Positive' Class : Alive
##
prp(dtree_fit$finalModel, box.palette = "Reds")
par(pty = "s")
test_pred_dt2 <- predict(dtree_fit, newdata = im_test,type="prob")
roc(im_test$status,test_pred_dt2[,2], plot=T,legacy.axes = T, col="orange",
main="ROC Curve of Decision Tree Model", xlab="FPR", ylab = "TPR",
print.auc = T, print.auc.x=0.4,print.auc.y=0.3)
##
## Call:
## roc.default(response = im_test$status, predictor = test_pred_dt2[, 2], plot = T, legacy.axes = T, col = "orange", main = "ROC Curve of Decision Tree Model", xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4, print.auc.y = 0.3)
##
## Data: test_pred_dt2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7005
set.seed(3233)
rf_fit <- train(status~., data=im_train, method="rf",
preProcess = c("center", "scale"),
trControl=trctrl,
tuneLength = 10,
metric = "ROC")
## note: only 9 unique complexity parameters in default grid. Truncating the grid to 9 .
rf_fit
## Random Forest
##
## 936 samples
## 10 predictor
## 2 classes: 'Alive', 'dead'
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 748, 749, 749, 749, 749, 749, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.7659809 0.6666667 0.7077566
## 3 0.7653163 0.6695341 0.7027996
## 4 0.7635464 0.6645161 0.7042031
## 5 0.7649400 0.6702509 0.7084658
## 6 0.7607980 0.6645161 0.7055991
## 7 0.7608882 0.6716846 0.7013438
## 8 0.7605067 0.6695341 0.7034789
## 9 0.7603504 0.6709677 0.6942516
## 10 0.7593865 0.6716846 0.7063009
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
rf_fit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 30.34%
## Confusion matrix:
## Alive dead class.error
## Alive 308 157 0.3376344
## dead 127 344 0.2696391
test_pred_rf <- predict(rf_fit, newdata = im_test)
confusionMatrix(test_pred_rf, im_test$status )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive dead
## Alive 126 50
## dead 72 151
##
## Accuracy : 0.6942
## 95% CI : (0.6465, 0.7391)
## No Information Rate : 0.5038
## P-Value [Acc > NIR] : 9.231e-15
##
## Kappa : 0.3879
##
## Mcnemar's Test P-Value : 0.05727
##
## Sensitivity : 0.6364
## Specificity : 0.7512
## Pos Pred Value : 0.7159
## Neg Pred Value : 0.6771
## Prevalence : 0.4962
## Detection Rate : 0.3158
## Detection Prevalence : 0.4411
## Balanced Accuracy : 0.6938
##
## 'Positive' Class : Alive
##
par(pty = "s")
test_pred_rf2 <- predict(rf_fit, newdata = im_test,type="prob")
roc(im_test$status,test_pred_rf2[,2], plot=T,legacy.axes = T, col="brown",
main="ROC Curve of Random Forsest Tree Model", xlab="FPR", ylab = "TPR",
print.auc = T, print.auc.x=0.4,print.auc.y=0.3)
##
## Call:
## roc.default(response = im_test$status, predictor = test_pred_rf2[, 2], plot = T, legacy.axes = T, col = "brown", main = "ROC Curve of Random Forsest Tree Model", xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4, print.auc.y = 0.3)
##
## Data: test_pred_rf2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7596
set.seed(3233)
trctrl_xg <- trainControl(method = "repeatedcv", number = 4, repeats = 3,classProbs = TRUE,
summaryFunction = twoClassSummary)
xgb_fit <- train(status~., data=im_train, method="xgbTree",
preProcess = c("center", "scale"),
trControl=trctrl_xg,
tuneLength = 4,
metric = "ROC")
xgb_fit
## eXtreme Gradient Boosting
##
## 936 samples
## 10 predictor
## 2 classes: 'Alive', 'dead'
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (4 fold, repeated 3 times)
## Summary of sample sizes: 701, 702, 702, 703, 702, 702, ...
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree subsample nrounds ROC
## 0.3 1 0.6 0.5000000 50 0.7771503
## 0.3 1 0.6 0.5000000 100 0.7691565
## 0.3 1 0.6 0.5000000 150 0.7611213
## 0.3 1 0.6 0.5000000 200 0.7584341
## 0.3 1 0.6 0.6666667 50 0.7773957
## 0.3 1 0.6 0.6666667 100 0.7730846
## 0.3 1 0.6 0.6666667 150 0.7679715
## 0.3 1 0.6 0.6666667 200 0.7601844
## 0.3 1 0.6 0.8333333 50 0.7821540
## 0.3 1 0.6 0.8333333 100 0.7727647
## 0.3 1 0.6 0.8333333 150 0.7693853
## 0.3 1 0.6 0.8333333 200 0.7645494
## 0.3 1 0.6 1.0000000 50 0.7841870
## 0.3 1 0.6 1.0000000 100 0.7789885
## 0.3 1 0.6 1.0000000 150 0.7735415
## 0.3 1 0.6 1.0000000 200 0.7717717
## 0.3 1 0.8 0.5000000 50 0.7720794
## 0.3 1 0.8 0.5000000 100 0.7644588
## 0.3 1 0.8 0.5000000 150 0.7595396
## 0.3 1 0.8 0.5000000 200 0.7573526
## 0.3 1 0.8 0.6666667 50 0.7774807
## 0.3 1 0.8 0.6666667 100 0.7699465
## 0.3 1 0.8 0.6666667 150 0.7618048
## 0.3 1 0.8 0.6666667 200 0.7593121
## 0.3 1 0.8 0.8333333 50 0.7810612
## 0.3 1 0.8 0.8333333 100 0.7750548
## 0.3 1 0.8 0.8333333 150 0.7673825
## 0.3 1 0.8 0.8333333 200 0.7629095
## 0.3 1 0.8 1.0000000 50 0.7847730
## 0.3 1 0.8 1.0000000 100 0.7788198
## 0.3 1 0.8 1.0000000 150 0.7752112
## 0.3 1 0.8 1.0000000 200 0.7724614
## 0.3 2 0.6 0.5000000 50 0.7552393
## 0.3 2 0.6 0.5000000 100 0.7461319
## 0.3 2 0.6 0.5000000 150 0.7399412
## 0.3 2 0.6 0.5000000 200 0.7380716
## 0.3 2 0.6 0.6666667 50 0.7636822
## 0.3 2 0.6 0.6666667 100 0.7506320
## 0.3 2 0.6 0.6666667 150 0.7451769
## 0.3 2 0.6 0.6666667 200 0.7395856
## 0.3 2 0.6 0.8333333 50 0.7702199
## 0.3 2 0.6 0.8333333 100 0.7550748
## 0.3 2 0.6 0.8333333 150 0.7445450
## 0.3 2 0.6 0.8333333 200 0.7400318
## 0.3 2 0.6 1.0000000 50 0.7701609
## 0.3 2 0.6 1.0000000 100 0.7561301
## 0.3 2 0.6 1.0000000 150 0.7489352
## 0.3 2 0.6 1.0000000 200 0.7448596
## 0.3 2 0.8 0.5000000 50 0.7558514
## 0.3 2 0.8 0.5000000 100 0.7403381
## 0.3 2 0.8 0.5000000 150 0.7327514
## 0.3 2 0.8 0.5000000 200 0.7252684
## 0.3 2 0.8 0.6666667 50 0.7579367
## 0.3 2 0.8 0.6666667 100 0.7537244
## 0.3 2 0.8 0.6666667 150 0.7453853
## 0.3 2 0.8 0.6666667 200 0.7415638
## 0.3 2 0.8 0.8333333 50 0.7671362
## 0.3 2 0.8 0.8333333 100 0.7533784
## 0.3 2 0.8 0.8333333 150 0.7447734
## 0.3 2 0.8 0.8333333 200 0.7393841
## 0.3 2 0.8 1.0000000 50 0.7727110
## 0.3 2 0.8 1.0000000 100 0.7556568
## 0.3 2 0.8 1.0000000 150 0.7479787
## 0.3 2 0.8 1.0000000 200 0.7420088
## 0.3 3 0.6 0.5000000 50 0.7480872
## 0.3 3 0.6 0.5000000 100 0.7376333
## 0.3 3 0.6 0.5000000 150 0.7357309
## 0.3 3 0.6 0.5000000 200 0.7308237
## 0.3 3 0.6 0.6666667 50 0.7473218
## 0.3 3 0.6 0.6666667 100 0.7402820
## 0.3 3 0.6 0.6666667 150 0.7378992
## 0.3 3 0.6 0.6666667 200 0.7347381
## 0.3 3 0.6 0.8333333 50 0.7576240
## 0.3 3 0.6 0.8333333 100 0.7487370
## 0.3 3 0.6 0.8333333 150 0.7443169
## 0.3 3 0.6 0.8333333 200 0.7416555
## 0.3 3 0.6 1.0000000 50 0.7642371
## 0.3 3 0.6 1.0000000 100 0.7483240
## 0.3 3 0.6 1.0000000 150 0.7450523
## 0.3 3 0.6 1.0000000 200 0.7437489
## 0.3 3 0.8 0.5000000 50 0.7481361
## 0.3 3 0.8 0.5000000 100 0.7372225
## 0.3 3 0.8 0.5000000 150 0.7314623
## 0.3 3 0.8 0.5000000 200 0.7291811
## 0.3 3 0.8 0.6666667 50 0.7517104
## 0.3 3 0.8 0.6666667 100 0.7418266
## 0.3 3 0.8 0.6666667 150 0.7372018
## 0.3 3 0.8 0.6666667 200 0.7342139
## 0.3 3 0.8 0.8333333 50 0.7571534
## 0.3 3 0.8 0.8333333 100 0.7485562
## 0.3 3 0.8 0.8333333 150 0.7438218
## 0.3 3 0.8 0.8333333 200 0.7402906
## 0.3 3 0.8 1.0000000 50 0.7621893
## 0.3 3 0.8 1.0000000 100 0.7474452
## 0.3 3 0.8 1.0000000 150 0.7437951
## 0.3 3 0.8 1.0000000 200 0.7418871
## 0.3 4 0.6 0.5000000 50 0.7494025
## 0.3 4 0.6 0.5000000 100 0.7408875
## 0.3 4 0.6 0.5000000 150 0.7389230
## 0.3 4 0.6 0.5000000 200 0.7404929
## 0.3 4 0.6 0.6666667 50 0.7524742
## 0.3 4 0.6 0.6666667 100 0.7457182
## 0.3 4 0.6 0.6666667 150 0.7423856
## 0.3 4 0.6 0.6666667 200 0.7397532
## 0.3 4 0.6 0.8333333 50 0.7505011
## 0.3 4 0.6 0.8333333 100 0.7392299
## 0.3 4 0.6 0.8333333 150 0.7389005
## 0.3 4 0.6 0.8333333 200 0.7355946
## 0.3 4 0.6 1.0000000 50 0.7524177
## 0.3 4 0.6 1.0000000 100 0.7458274
## 0.3 4 0.6 1.0000000 150 0.7430005
## 0.3 4 0.6 1.0000000 200 0.7395298
## 0.3 4 0.8 0.5000000 50 0.7421233
## 0.3 4 0.8 0.5000000 100 0.7333384
## 0.3 4 0.8 0.5000000 150 0.7332263
## 0.3 4 0.8 0.5000000 200 0.7317416
## 0.3 4 0.8 0.6666667 50 0.7372878
## 0.3 4 0.8 0.6666667 100 0.7342650
## 0.3 4 0.8 0.6666667 150 0.7343053
## 0.3 4 0.8 0.6666667 200 0.7312627
## 0.3 4 0.8 0.8333333 50 0.7500758
## 0.3 4 0.8 0.8333333 100 0.7456205
## 0.3 4 0.8 0.8333333 150 0.7405931
## 0.3 4 0.8 0.8333333 200 0.7390485
## 0.3 4 0.8 1.0000000 50 0.7514692
## 0.3 4 0.8 1.0000000 100 0.7450334
## 0.3 4 0.8 1.0000000 150 0.7430389
## 0.3 4 0.8 1.0000000 200 0.7416189
## 0.4 1 0.6 0.5000000 50 0.7722520
## 0.4 1 0.6 0.5000000 100 0.7655410
## 0.4 1 0.6 0.5000000 150 0.7586266
## 0.4 1 0.6 0.5000000 200 0.7521452
## 0.4 1 0.6 0.6666667 50 0.7727406
## 0.4 1 0.6 0.6666667 100 0.7635691
## 0.4 1 0.6 0.6666667 150 0.7592640
## 0.4 1 0.6 0.6666667 200 0.7556049
## 0.4 1 0.6 0.8333333 50 0.7755629
## 0.4 1 0.6 0.8333333 100 0.7647290
## 0.4 1 0.6 0.8333333 150 0.7620215
## 0.4 1 0.6 0.8333333 200 0.7565957
## 0.4 1 0.6 1.0000000 50 0.7807951
## 0.4 1 0.6 1.0000000 100 0.7731696
## 0.4 1 0.6 1.0000000 150 0.7679442
## 0.4 1 0.6 1.0000000 200 0.7658862
## 0.4 1 0.8 0.5000000 50 0.7690274
## 0.4 1 0.8 0.5000000 100 0.7563383
## 0.4 1 0.8 0.5000000 150 0.7464715
## 0.4 1 0.8 0.5000000 200 0.7425717
## 0.4 1 0.8 0.6666667 50 0.7753294
## 0.4 1 0.8 0.6666667 100 0.7641090
## 0.4 1 0.8 0.6666667 150 0.7582126
## 0.4 1 0.8 0.6666667 200 0.7496569
## 0.4 1 0.8 0.8333333 50 0.7772010
## 0.4 1 0.8 0.8333333 100 0.7667864
## 0.4 1 0.8 0.8333333 150 0.7592413
## 0.4 1 0.8 0.8333333 200 0.7563944
## 0.4 1 0.8 1.0000000 50 0.7826603
## 0.4 1 0.8 1.0000000 100 0.7741028
## 0.4 1 0.8 1.0000000 150 0.7716598
## 0.4 1 0.8 1.0000000 200 0.7674480
## 0.4 2 0.6 0.5000000 50 0.7488612
## 0.4 2 0.6 0.5000000 100 0.7439722
## 0.4 2 0.6 0.5000000 150 0.7362162
## 0.4 2 0.6 0.5000000 200 0.7300281
## 0.4 2 0.6 0.6666667 50 0.7645254
## 0.4 2 0.6 0.6666667 100 0.7527623
## 0.4 2 0.6 0.6666667 150 0.7397998
## 0.4 2 0.6 0.6666667 200 0.7370224
## 0.4 2 0.6 0.8333333 50 0.7593424
## 0.4 2 0.6 0.8333333 100 0.7460863
## 0.4 2 0.6 0.8333333 150 0.7392883
## 0.4 2 0.6 0.8333333 200 0.7345042
## 0.4 2 0.6 1.0000000 50 0.7686791
## 0.4 2 0.6 1.0000000 100 0.7509756
## 0.4 2 0.6 1.0000000 150 0.7427852
## 0.4 2 0.6 1.0000000 200 0.7377539
## 0.4 2 0.8 0.5000000 50 0.7553110
## 0.4 2 0.8 0.5000000 100 0.7436727
## 0.4 2 0.8 0.5000000 150 0.7457551
## 0.4 2 0.8 0.5000000 200 0.7329734
## 0.4 2 0.8 0.6666667 50 0.7533077
## 0.4 2 0.8 0.6666667 100 0.7452239
## 0.4 2 0.8 0.6666667 150 0.7397192
## 0.4 2 0.8 0.6666667 200 0.7356257
## 0.4 2 0.8 0.8333333 50 0.7585926
## 0.4 2 0.8 0.8333333 100 0.7414705
## 0.4 2 0.8 0.8333333 150 0.7386803
## 0.4 2 0.8 0.8333333 200 0.7371954
## 0.4 2 0.8 1.0000000 50 0.7662917
## 0.4 2 0.8 1.0000000 100 0.7485021
## 0.4 2 0.8 1.0000000 150 0.7412513
## 0.4 2 0.8 1.0000000 200 0.7362407
## 0.4 3 0.6 0.5000000 50 0.7408208
## 0.4 3 0.6 0.5000000 100 0.7280308
## 0.4 3 0.6 0.5000000 150 0.7284110
## 0.4 3 0.6 0.5000000 200 0.7302277
## 0.4 3 0.6 0.6666667 50 0.7384999
## 0.4 3 0.6 0.6666667 100 0.7329009
## 0.4 3 0.6 0.6666667 150 0.7348765
## 0.4 3 0.6 0.6666667 200 0.7333970
## 0.4 3 0.6 0.8333333 50 0.7516424
## 0.4 3 0.6 0.8333333 100 0.7435843
## 0.4 3 0.6 0.8333333 150 0.7358943
## 0.4 3 0.6 0.8333333 200 0.7314122
## 0.4 3 0.6 1.0000000 50 0.7559548
## 0.4 3 0.6 1.0000000 100 0.7437773
## 0.4 3 0.6 1.0000000 150 0.7401895
## 0.4 3 0.6 1.0000000 200 0.7365360
## 0.4 3 0.8 0.5000000 50 0.7353193
## 0.4 3 0.8 0.5000000 100 0.7248205
## 0.4 3 0.8 0.5000000 150 0.7257698
## 0.4 3 0.8 0.5000000 200 0.7246873
## 0.4 3 0.8 0.6666667 50 0.7445538
## 0.4 3 0.8 0.6666667 100 0.7380094
## 0.4 3 0.8 0.6666667 150 0.7375367
## 0.4 3 0.8 0.6666667 200 0.7366888
## 0.4 3 0.8 0.8333333 50 0.7458574
## 0.4 3 0.8 0.8333333 100 0.7385728
## 0.4 3 0.8 0.8333333 150 0.7350573
## 0.4 3 0.8 0.8333333 200 0.7349095
## 0.4 3 0.8 1.0000000 50 0.7557012
## 0.4 3 0.8 1.0000000 100 0.7425303
## 0.4 3 0.8 1.0000000 150 0.7396849
## 0.4 3 0.8 1.0000000 200 0.7379944
## 0.4 4 0.6 0.5000000 50 0.7334725
## 0.4 4 0.6 0.5000000 100 0.7336885
## 0.4 4 0.6 0.5000000 150 0.7316072
## 0.4 4 0.6 0.5000000 200 0.7313384
## 0.4 4 0.6 0.6666667 50 0.7392119
## 0.4 4 0.6 0.6666667 100 0.7301911
## 0.4 4 0.6 0.6666667 150 0.7301357
## 0.4 4 0.6 0.6666667 200 0.7293951
## 0.4 4 0.6 0.8333333 50 0.7409025
## 0.4 4 0.6 0.8333333 100 0.7375138
## 0.4 4 0.6 0.8333333 150 0.7350785
## 0.4 4 0.6 0.8333333 200 0.7326485
## 0.4 4 0.6 1.0000000 50 0.7469148
## 0.4 4 0.6 1.0000000 100 0.7429984
## 0.4 4 0.6 1.0000000 150 0.7411131
## 0.4 4 0.6 1.0000000 200 0.7393511
## 0.4 4 0.8 0.5000000 50 0.7304144
## 0.4 4 0.8 0.5000000 100 0.7275738
## 0.4 4 0.8 0.5000000 150 0.7241343
## 0.4 4 0.8 0.5000000 200 0.7229221
## 0.4 4 0.8 0.6666667 50 0.7365866
## 0.4 4 0.8 0.6666667 100 0.7319421
## 0.4 4 0.8 0.6666667 150 0.7296644
## 0.4 4 0.8 0.6666667 200 0.7266519
## 0.4 4 0.8 0.8333333 50 0.7439798
## 0.4 4 0.8 0.8333333 100 0.7399107
## 0.4 4 0.8 0.8333333 150 0.7377882
## 0.4 4 0.8 0.8333333 200 0.7358172
## 0.4 4 0.8 1.0000000 50 0.7528588
## 0.4 4 0.8 1.0000000 100 0.7482183
## 0.4 4 0.8 1.0000000 150 0.7437316
## 0.4 4 0.8 1.0000000 200 0.7406688
## Sens Spec
## 0.6925042 0.6928872
## 0.6896245 0.6858733
## 0.6738874 0.6837064
## 0.6680973 0.6922353
## 0.6832142 0.7141279
## 0.6846264 0.7113212
## 0.6982759 0.6900442
## 0.6896490 0.6843583
## 0.6989636 0.6943298
## 0.6910797 0.6893500
## 0.6868000 0.6865312
## 0.6874939 0.6858190
## 0.7047291 0.6915170
## 0.6953900 0.6943177
## 0.6867755 0.6964665
## 0.6867693 0.6971546
## 0.6796223 0.7035407
## 0.6867877 0.7007219
## 0.6760303 0.6872435
## 0.6932533 0.6907745
## 0.6867693 0.6971425
## 0.6860571 0.6922111
## 0.6824958 0.6879557
## 0.6889061 0.6914566
## 0.6975452 0.6964243
## 0.6867570 0.6971305
## 0.6860571 0.6935934
## 0.6846387 0.6978427
## 0.7075781 0.6964363
## 0.6996942 0.6964303
## 0.6925165 0.6928932
## 0.6882000 0.6971305
## 0.6738444 0.6957060
## 0.6767241 0.6809117
## 0.6767303 0.6830122
## 0.6695587 0.6737831
## 0.6623686 0.6999976
## 0.6681096 0.6893923
## 0.6745321 0.6844367
## 0.6594950 0.6731131
## 0.6867325 0.6964484
## 0.6774487 0.6822879
## 0.6630931 0.6823060
## 0.6594889 0.6773565
## 0.6767241 0.7000157
## 0.6724261 0.6921930
## 0.6724384 0.6851007
## 0.6716770 0.6815998
## 0.6673851 0.7014040
## 0.6645790 0.6879014
## 0.6480131 0.6715981
## 0.6386740 0.6695338
## 0.6623563 0.6921809
## 0.6723708 0.6815575
## 0.6673114 0.6695821
## 0.6616072 0.6738314
## 0.6852957 0.7056352
## 0.6824467 0.6957362
## 0.6745628 0.6879557
## 0.6580583 0.6858371
## 0.6752935 0.7042892
## 0.6630931 0.6886619
## 0.6666482 0.6731071
## 0.6717015 0.6738314
## 0.6702709 0.6808815
## 0.6501682 0.6850826
## 0.6608704 0.6808392
## 0.6544417 0.6815575
## 0.6551847 0.6879436
## 0.6680543 0.6745135
## 0.6608950 0.6752197
## 0.6637440 0.6780506
## 0.6702955 0.6943298
## 0.6623256 0.6900563
## 0.6573153 0.6901227
## 0.6601766 0.6964725
## 0.6882491 0.6900563
## 0.6702648 0.6745075
## 0.6731138 0.6759078
## 0.6702525 0.6717007
## 0.6702648 0.6646446
## 0.6551847 0.6773685
## 0.6522620 0.6801632
## 0.6522804 0.6787568
## 0.6644992 0.6908047
## 0.6666605 0.6851731
## 0.6659421 0.6787870
## 0.6609380 0.6759621
## 0.6595257 0.7049290
## 0.6631054 0.6893681
## 0.6566215 0.6936477
## 0.6530172 0.6865373
## 0.6745751 0.6922172
## 0.6731138 0.6822698
## 0.6651992 0.6816119
## 0.6594582 0.6780506
## 0.6602073 0.6780687
## 0.6594521 0.6752801
## 0.6630502 0.6702762
## 0.6637563 0.6667210
## 0.6788240 0.6851188
## 0.6702279 0.6751835
## 0.6622888 0.6837245
## 0.6594521 0.6815817
## 0.6595319 0.7077720
## 0.6451763 0.6759078
## 0.6429905 0.6787447
## 0.6415414 0.6815636
## 0.6702648 0.6773263
## 0.6638177 0.6745075
## 0.6559092 0.6702943
## 0.6580644 0.6681334
## 0.6537049 0.6843703
## 0.6608704 0.6624294
## 0.6551233 0.6681213
## 0.6508191 0.6731011
## 0.6666482 0.6794449
## 0.6551663 0.6808453
## 0.6594643 0.6744954
## 0.6594582 0.6823000
## 0.6652176 0.6887102
## 0.6616441 0.6759501
## 0.6544909 0.6900744
## 0.6602257 0.6808755
## 0.6587951 0.6865373
## 0.6487253 0.6822939
## 0.6580337 0.6752076
## 0.6609073 0.6823060
## 0.6796345 0.7020679
## 0.6810222 0.6957543
## 0.6731076 0.7042409
## 0.6860018 0.6844307
## 0.6903367 0.6893621
## 0.6824835 0.6957603
## 0.6710261 0.6929234
## 0.6774425 0.6971546
## 0.7039616 0.6907504
## 0.6946532 0.6851067
## 0.6889061 0.6851067
## 0.6874816 0.6922232
## 0.6961207 0.7013859
## 0.6946593 0.6914747
## 0.6874939 0.6858190
## 0.6839142 0.6900804
## 0.6738751 0.7013738
## 0.6709893 0.6907745
## 0.6781179 0.6780204
## 0.6724138 0.6695338
## 0.6953900 0.6928811
## 0.6824713 0.6921387
## 0.6875246 0.6737711
## 0.6817467 0.6737771
## 0.6961145 0.6907745
## 0.6889245 0.6900744
## 0.6731567 0.6879436
## 0.6838896 0.6823000
## 0.7075842 0.6985489
## 0.6896490 0.6893380
## 0.6860509 0.6971365
## 0.6867570 0.6907745
## 0.6645852 0.6935692
## 0.6673851 0.6836762
## 0.6666175 0.6737409
## 0.6573214 0.6709824
## 0.6766996 0.7049109
## 0.6695341 0.6886921
## 0.6702218 0.6738254
## 0.6694850 0.6681395
## 0.6838896 0.6829881
## 0.6609441 0.6858069
## 0.6565908 0.6858190
## 0.6616502 0.6830122
## 0.6881693 0.6950179
## 0.6760057 0.6844246
## 0.6630870 0.6645963
## 0.6688464 0.6617775
## 0.6709954 0.6957120
## 0.6809976 0.6716343
## 0.6737953 0.6773142
## 0.6587398 0.6865252
## 0.6673728 0.6801693
## 0.6744953 0.6723586
## 0.6601643 0.6660087
## 0.6579723 0.6745135
## 0.6738874 0.6971305
## 0.6630931 0.6900985
## 0.6759566 0.6872676
## 0.6616257 0.6731252
## 0.6809915 0.6943177
## 0.6752812 0.6759440
## 0.6659053 0.6738133
## 0.6551417 0.6709945
## 0.6766689 0.6886559
## 0.6558540 0.6794389
## 0.6630440 0.6596347
## 0.6652176 0.6688396
## 0.6716770 0.6836581
## 0.6651808 0.6900381
## 0.6651746 0.6780144
## 0.6486946 0.6815455
## 0.6702341 0.6744592
## 0.6637747 0.6772961
## 0.6687911 0.6766201
## 0.6601704 0.6723767
## 0.6716831 0.6864709
## 0.6616502 0.6680550
## 0.6565969 0.6716222
## 0.6515559 0.6723586
## 0.6659053 0.6731373
## 0.6472701 0.6752318
## 0.6544233 0.6674574
## 0.6522497 0.6716645
## 0.6644869 0.6928992
## 0.6587644 0.6872374
## 0.6630686 0.6865493
## 0.6651869 0.6986214
## 0.6681034 0.6929113
## 0.6651746 0.6745014
## 0.6565785 0.6723888
## 0.6630317 0.6731131
## 0.6752751 0.6864829
## 0.6659544 0.6744471
## 0.6630931 0.6758837
## 0.6645238 0.6766140
## 0.6594643 0.6779782
## 0.6666605 0.6708979
## 0.6559031 0.6716041
## 0.6516050 0.6716283
## 0.6738322 0.6816300
## 0.6601950 0.6695760
## 0.6630440 0.6681213
## 0.6601520 0.6667331
## 0.6637440 0.6872676
## 0.6573214 0.6744954
## 0.6508743 0.6738012
## 0.6573214 0.6801994
## 0.6637992 0.6886861
## 0.6523357 0.6801813
## 0.6545031 0.6773444
## 0.6566460 0.6844307
## 0.6602134 0.6653508
## 0.6601397 0.6639082
## 0.6587091 0.6575523
## 0.6594152 0.6476472
## 0.6401169 0.6886438
## 0.6515743 0.6766201
## 0.6529865 0.6815636
## 0.6450781 0.6836762
## 0.6566031 0.6766744
## 0.6573153 0.6773565
## 0.6566031 0.6759501
## 0.6572969 0.6702822
## 0.6630870 0.6957724
## 0.6680789 0.6844428
## 0.6630563 0.6723888
## 0.6609073 0.6709764
##
## Tuning parameter 'gamma' was held constant at a value of 0
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 50, max_depth = 1,
## eta = 0.3, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1
## and subsample = 1.
xgb_fit$finalModel
## ##### xgb.Booster
## raw: 13.1 Kb
## call:
## xgboost::xgb.train(params = list(eta = param$eta, max_depth = param$max_depth,
## gamma = param$gamma, colsample_bytree = param$colsample_bytree,
## min_child_weight = param$min_child_weight, subsample = param$subsample),
## data = x, nrounds = param$nrounds, objective = "binary:logistic")
## params (as set within xgb.train):
## eta = "0.3", max_depth = "1", gamma = "0", colsample_bytree = "0.8", min_child_weight = "1", subsample = "1", objective = "binary:logistic", silent = "1"
## xgb.attributes:
## niter
## callbacks:
## cb.print.evaluation(period = print_every_n)
## # of features: 10
## niter: 50
## nfeatures : 10
## xNames : age los glucose_num sodium_num wbc_count_num calcium_num hemoglobin_num creatinine_num urea_nitrogen_num chloride_num
## problemType : Classification
## tuneValue :
## nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 29 50 1 0.3 0 0.8 1 1
## obsLevels : Alive dead
## param :
## list()
test_pred_xgb <- predict(xgb_fit, newdata = im_test)
confusionMatrix(test_pred_xgb, im_test$status )
## Confusion Matrix and Statistics
##
## Reference
## Prediction Alive dead
## Alive 137 63
## dead 61 138
##
## Accuracy : 0.6892
## 95% CI : (0.6413, 0.7343)
## No Information Rate : 0.5038
## P-Value [Acc > NIR] : 4.575e-14
##
## Kappa : 0.3785
##
## Mcnemar's Test P-Value : 0.9284
##
## Sensitivity : 0.6919
## Specificity : 0.6866
## Pos Pred Value : 0.6850
## Neg Pred Value : 0.6935
## Prevalence : 0.4962
## Detection Rate : 0.3434
## Detection Prevalence : 0.5013
## Balanced Accuracy : 0.6892
##
## 'Positive' Class : Alive
##
par(pty = "s")
test_pred_xgb2 <- predict(xgb_fit, newdata = im_test,type="prob")
roc(im_test$status,test_pred_xgb2[,2], plot=T,legacy.axes = T, col="black",
main="ROC Curve of XGBoost Model", xlab="FPR", ylab = "TPR",
print.auc = T, print.auc.x=0.4,print.auc.y=0.3)
##
## Call:
## roc.default(response = im_test$status, predictor = test_pred_xgb2[, 2], plot = T, legacy.axes = T, col = "black", main = "ROC Curve of XGBoost Model", xlab = "FPR", ylab = "TPR", print.auc = T, print.auc.x = 0.4, print.auc.y = 0.3)
##
## Data: test_pred_xgb2[, 2] in 198 controls (im_test$status Alive) < 201 cases (im_test$status dead).
## Area under the curve: 0.7531
par(pty = "s")
roc.curve(im_test$status,p_log)
## Area under the curve (AUC): 0.748
roc.curve(im_test$status,test_pred_sl2[,2], add=TRUE, col=2,
lwd=2, lty=2)
## Area under the curve (AUC): 0.747
roc.curve(im_test$status,test_pred_sp2[,2], add=TRUE, col=3,
lwd=2, lty=3)
## Area under the curve (AUC): 0.742
roc.curve(im_test$status,test_pred_sr2[,2], add=TRUE, col=4,
lwd=2, lty=4)
## Area under the curve (AUC): 0.746
roc.curve(im_test$status,test_pred_dt2[,2], add=TRUE, col=5,
lwd=2, lty=5)
## Area under the curve (AUC): 0.700
roc.curve(im_test$status,test_pred_rf2[,2], add=TRUE, col=6,
lwd=2, lty=6)
## Area under the curve (AUC): 0.761
roc.curve(im_test$status,test_pred_xgb2[,2],add=TRUE, col=7,
lwd=2, lty=7)
## Area under the curve (AUC): 0.754
legend("bottomright",
c("Logistic Regression", "SVM Linear", "SVM Poly", "SVM RBF", "Decision tree",
"Random Forest","XGBoost"),
col=1:7, lty=1:7, lwd=2,cex = 0.5)
par(pty = "m")
Model AUC Sensitivity Specifity
Logistic regression 0.748 0.84 0.50
SVM Linear 0.747 0.69 0.71
SVM Polynomial: 0.742 0.71 0.68
SVM RBF: 0.746 0.67 0.69
Decision Tree: 0.700 0.55 0.80
Random Forest: 0.761 0.63 0.75
XGBoost: 0.754 0.69 0.69
Although AUC of decision tree model is lowerest in these models, it can predict 80% of patients mortality. It’s suitable for practice. And the tree plot indicates that we can put age, glucose, wbc_count and urea_nitrogen as predictors for heart failure mortality. This result is also consistent with domain knowledge. The graph of tree is also easily for representation. So I would like to choose decision tree models as inference model rather than logistic regression model.
For predictable model, AUC value of random forest and xgboost model are similar and higher than three SVM model, but 0.76 is not a good result. The reason is that the observations of dataset is not enough, we need huge observation for better results of machine learning methods.